Clustered Communication for Efficient Pipelined Multithreading on Commodity MCPs

نویسندگان

  • Yuanming Zhang
  • Kanemitsu Ootsu
  • Takashi Yokota
  • Takanobu Baba
چکیده

Low inter-core communication overheads are critical for pipelined multithreading (PMT) to using multi-core processors (MCPs) to improve the performance of general sequential applications. However, conventional software queue based communication mechanism will bring significant communication overheads, which limit the potential performance and hinder the wide commercial use. While dedicated intercore communication mechanism has been proposed, it demands chip redesign effort, costs so much and needs extensions to ISA. This paper addresses this problem and proposes a novel clustered communication mechanism to minimize the communication overheads from the average standpoint. We observe that the PMT performance is very sensitive to inter-core communication overheads, but is insensitive to amount of parallelisms. Based on the observation, we can achieve very low average communication overheads (ACOs) through sacrificing a certain amount of parallelisms. The principle of clustered communication mechanism and how to reduce the ACOs with this mechanism are presented in detail. A concurrent lockfree clustered software queue algorithm, which applies this mechanism, is given to support the pipelined communication. The algorithm is evaluated on the AMD Phenom four-core processor and experimental results show its communication performance is over 10x faster than that of conventional software queue, and significant PMT performance of real applications are, therefore, achieved. Keywords-Pipelined multithreading; commodity multi-core processors; software queue; clustered communication; low inter-core communication overheads.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Amortizing Software Queue Overhead for Pipelined Inter-Thread Communication

Future chip multiprocessors are expected to contain multiple ondie processing cores. Increased memory system contention and wire delays will result in high inter-core latencies in these processors. Thus, parallelizing applications to efficiently execute on multiple contexts is key to achieving continued performance improvements. Recently proposed pipelined multithreading (PMT) techniques have s...

متن کامل

Pipelined Multithreading Transformations and Support Mechanisms

Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, the multiple on-chip cores do not directly result in improved application performance (especially for legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallelizing applications to execute on multiple cores is essential to their success. Independent multit...

متن کامل

Software-Based Communication Latency Hiding for Commodity Workstation Networks

A variety of latency hiding techniques has been investigated at the hardware level. However, except multithreading, which may require substantial program structuring effort, other software-based latency hiding methods have not been investigated. In this paper, we consider design alternatives for latency hiding other than multithreading. Furthermore, we present experimental evidence for the vali...

متن کامل

The Synchronized Pipelined Parallelism Model

In many of the current and next generation Chip Multiprocessors (CMP) and Simultaneous Multithreading (SMT) processor systems, the processors share one or more cache levels along with the memory interface. Being shared resources, the caches and the memory interface are critical to the performance of the overall system. So, while these processor systems offer significant potential for parallelis...

متن کامل

The Nexus Approach to Integrating Multithreading and Communication

Lightweight threads have an important role to play in parallel systems: they can be used to exploit shared-memory parallelism, to mask communication and I/O latencies, to implement remote memory access, and to support task-parallel and irregular applications. In this paper, we address the question of how to integrate threads and communication in high-performance distributed-memory systems. We p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009